data valuation
- North America > United States > Virginia (0.04)
- Europe > France (0.04)
- North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
- (4 more...)
- Overview (0.67)
- Research Report > New Finding (0.67)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.93)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Singapore (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (4 more...)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.67)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine (0.68)
- Education > Educational Setting > Online (0.46)
CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification
Data valuation, or the valuation of individual datum contributions, has seen growing interest in machine learning due to its demonstrable efficacy for tasks such as noisy label detection. In particular, due to the desirable axiomatic properties, several Shapley value approximations have been proposed. In these methods, the value function is usually defined as the predictive accuracy over the entire development set. However, this limits the ability to differentiate between training instances that are helpful or harmful to their own classes. Intuitively, instances that harm their own classes may be noisy or mislabeled and should receive a lower valuation than helpful instances. In this work, we propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions. Our theoretical analysis shows the proposed value function is (essentially) the unique function that satisfies two desirable properties for evaluating data values in classification. Further, our experiments on two benchmark evaluation tasks (data removal and noisy label detection) and four classifiers demonstrate the effectiveness of CS-Shapley over existing methods. Lastly, we evaluate the "transferability" of data values estimated from one classifier to others, and our results suggest Shapley-based data valuation is transferable for application across different models.
Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic
Tamine, Mélissa, Sakhi, Otmane, Heymann, Benjamin
Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.
- Europe > Austria > Vienna (0.15)
- Europe > France (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (9 more...)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (3 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
Lightweight Time Series Data Valuation on Time Series Foundation Models via In-Context Finetuning
Wu, Shunyu, Li, Tianyue, Leng, Yixuan, Suo, Jingyi, Lou, Jian, Li, Dan, Ng, See-Kiong
Time series foundation models (TSFMs) have demonstrated increasing capabilities due to their extensive pretraining on large volumes of diverse time series data. Consequently, the quality of time series data is crucial to TSFM performance, rendering an accurate and efficient data valuation of time series for TSFMs indispensable. However, traditional data valuation methods, such as influence functions, face severe computational bottlenecks due to their poor scalability with growing TSFM model sizes and often fail to preserve temporal dependencies. In this paper, we propose LTSV, a Lightweight Time Series Valuation on TSFMS via in-context finetuning. Grounded in the theoretical evidence that in-context finetuning approximates the influence function, LTSV estimates a sample's contribution by measuring the change in context loss after in-context finetuning, leveraging the strong generalization capabilities of TSFMs to produce robust and transferable data valuations. To capture temporal dependencies, we introduce temporal block aggregation, which integrates per-block influence scores across overlapping time windows. Experiments across multiple time series datasets and models demonstrate that LTSV consistently provides reliable and strong valuation performance, while maintaining manageable computational requirements. Our results suggest that in-context finetuning on time series foundation models provides a practical and effective bridge between data attribution and model generalization in time series learning.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Data Science (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > France (0.04)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- North America > Canada > Alberta (0.14)
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California (0.04)
- Transportation (0.46)
- Banking & Finance (0.46)